Questions and answers
Were there expression values that were not unique for specific genes? How did you handle these?
Were there expression values that could not be mapped to current HUGO symbols?
Download all the require packages and reference the libraries
readr allows the program to read large datasets.
biomaRt is a Bioconductor package that implements the RESTful API of biomart, the annotation framwork for model organism genomes at the EBI. It is a Bioconductor package, and as such it needs to be loaded via the BiocManager,
Reference the libraries
Check if there is any duplicated genes or features
Mapping via biomaRt The data that I am working contains HUGO gene symbols so I am going to map it to Ensembl ID
Take a look at the new mapped table
Check duplicates
To fix those duplicates - Find them
Make sure there is no empty values
Only 6 duplicates so we can check it on the ensembl gene portal
According to Ensembl gene portal - LINC00856 is an external reference matched to LINC00595 - C12orf74 is an external reference matched PLEKHG7 - We can remove both
Save the mapping
Group all the subtypes/ features
Compare read counts in different breast tumour subtypes
Now, we can see the distrubution of the data
Boxplot - with data from the original data before normalization
MDS Plot - Compare between all subtypes
Density Plot
The of the most dynamic cell populations found in turmous are leukocytes. They play a very important role in normal breast tissue remodeling during pregnancy and involution. This study performed a gene expression analysis in leukocytes sorted from normal breast tissues, ductal carcinomas in situ (DCIS), and HER2+ and triple negative invasive ductal carcinomas (IDC). It focuses on the immune escape during breast tumor progression.
About the dataset
This study performed RNA-seq on purified CD45+ T cells from normal breast tissues (n=12),DCIS (n=11), and IDC (n=12), focusing on HER2+CD3+ (n=5) and TN (n=6) IDC cases. RNA was isolated from purified CD45+CD3+ leukocytes by cell sorting.
Source name resected breast tumor Organism Homo sapiens Characteristics tissue of origin: breast (tumor) cell type: CD45+CD3+ T cell breast tumor subtype: DCIS/ IDC/ None
Results from A1 Retrieved dataset(GSE87517) from GEOquery.Checked if there is any duplicated genes and features in the dataset. Since the dataset contains HUGO symbols so I mapped it with ENSEMBL gene id. Gathered all the information from BioManager::useMart and mapped it with our dataset. Only 6 duplicates existed in the mapping therefore I manually filtered out the duplicates by checking them on the ensembl gene portal. Normalized the data and filtered out low counts data. Look at the distribution of the dataset by Boxplot, MDS plot and Density plot. The final coverage of my dataset is 26440 out of 27011 observed values (missing ~2.2% of the dataset)
Install all the required packages
Reference the packages
Look at the most common genes of the same subfamily in our dataset
CD8A - According to Entrez Gene Summary: - CD8A acts as a coreceptor with the T-cell receptor on the T lymphocyte to recognize antigens
CTLA-4 : immune checkpoint proteins - mentioned in the results of the article
signaling pathways were upregulated in IDC compared to DCIS
CTLA-4 were also more abundant in T cells from IDCs than in DCIS,
Use edgeR to perform differential expression analysis as edgeR is designed for analyzing RNA-seq data. using Quasi liklihood method to get the estimation of allowing overdispersion and this method can fit data exhibiting overdispersion using fully specified probability models
Design the model for the anaylsis
Take a look at the MDS plot
According to above MDS Plot, setting the magnitude of log-fold change of at least 1 and less than one seems reasonable.
Calculate differential expression
Make a contrast table explicitly comparing normal genes and affected (dcis and idc) genes
Plot the heatmap to compare the expressions of normal/ unaffected genes and affected genes.
Up regualted genes
Down regulated genes
Save all genes lists
To perform a thresholded gene set enrichment analysis, we can either go on the web interface (https://biit.cs.ut.ee/gprofiler/gost) to run g:Profiler or use the gprofiler2 package.
Using the function gost() to retrieve the functional enrichment analysis of gene lists. The annotation data and the version of the annotation I am using is:
GO:MF – releases/2019-07-01
GO:CC – releases/2019-07-01
GO:BP – releases/2019-07-01
KEGG – KEGG FTP Release 2019-09-30
REAC – annotations: ensembl
classes: 2019-10-2
WP – 20190910
TF – annotations: TRANSFAC Release 2019.1
classes: v2
MIRNA – Release 7.0
HPA – annotations: HPA website: 2017-12-01
classes: script: 2018-03-19
CORUM – 03.09.2018 Corum 3.0
HP – hpo.annotations.monthly #157
The of the most dynamic cell populations found in turmous are leukocytes. They play a very important role in normal breast tissue remodeling during pregnancy and involution. This study performed a gene expression analysis in leukocytes sorted from normal breast tissues, ductal carcinomas in situ (DCIS), and HER2+ and triple negative invasive ductal carcinomas (IDC). It focuses on the immune escape during breast tumor progression.
About the dataset
This study performed RNA-seq on purified CD45+ T cells from normal breast tissues (n=12),DCIS (n=11), and IDC (n=12), focusing on HER2+CD3+ (n=5) and TN (n=6) IDC cases. RNA was isolated from purified CD45+CD3+ leukocytes by cell sorting.
Source name resected breast tumor Organism Homo sapiens Characteristics tissue of origin: breast (tumor) cell type: CD45+CD3+ T cell breast tumor subtype: DCIS/ IDC/ None
Results from A1 Retrieved dataset(GSE87517) from GEOquery.Checked if there is any duplicated genes and features in the dataset. Since the dataset contains HUGO symbols so I mapped it with ENSEMBL gene id. Gathered all the information from BioManager::useMart and mapped it with our dataset. Only 6 duplicates existed in the mapping therefore I manually filtered out the duplicates by checking them on the ensembl gene portal. Normalized the data and filtered out low counts data. Look at the distribution of the dataset by Boxplot, MDS plot and Density plot. The final coverage of my dataset is 26440 out of 27011 observed values (missing ~2.2% of the dataset)
Install all the required packages
Reference the packages
library(Biobase)
library(ComplexHeatmap)
library(circlize)
library(gprofiler2)
Non-thresholded gene set enrichment analysis is performed using GSEA v4.0.3. The text file of the ranked-list is generated from A2. The rank is calculated by log(diff_exp\(PValue,base =10) * sign(diff_exp\)logFC).
head(diff_exp)
## x logFC logCPM F PValue FDR
## 1 A1BG 2.9400748 2.4450740 8.82681786 3.774597e-03 0.0219439121
## 2 A1BG-AS1 7.5598692 6.6977064 24.30061395 3.568651e-06 0.0002682045
## 3 A1CF -0.1565448 0.2504336 0.02953652 8.639183e-01 0.9191069107
## 4 A2M -2.1776375 4.4929525 4.57689425 3.501582e-02 0.1052300017
## 5 A2M-AS1 -3.7651267 3.5712110 13.24770546 4.475751e-04 0.0051791160
## 6 A2ML1 1.4551196 3.5230584 1.86643993 1.751669e-01 0.3226719275
## rank
## 1 2.42312941
## 2 5.44749588
## 3 -0.06352734
## 4 -1.45573569
## 5 -3.34913411
## 6 0.75654800
#upregulated genes
head(up_genes)
## [1] A1BG A1BG-AS1 AADACL3 AADAT AAK1 AANAT
## 27011 Levels: A1BG A1BG-AS1 A1CF A2M A2M-AS1 A2ML1 A2MP1 A3GALT2 A4GALT ... ZZZ3
#downregulated genes
head(down_genes)
## [1] A2M A2M-AS1 AADACL4 AAMDC ABCA1 ABCA9
## 27011 Levels: A1BG A1BG-AS1 A1CF A2M A2M-AS1 A2ML1 A2MP1 A3GALT2 A4GALT ... ZZZ3
Methods: GSEA v4.0.3 for Windows
Genesets: Genesets from the baderlab geneset collection from February 1, 2020 containing GO biological process, no IEA and pathways
Version: v4.0.3
Results:
Top 10 upregualted genes:
The complement system is made up of a large number of distinct plasma proteins that react with one another to opsonize pathogens and induce a series of inflammatory responses that help to fight infection.
Top 10 downregulated genes:
Compare results:
From A2, we perfromed a enrichment analysis using g:profiler. From there, the top upgregulated are: cellcular response to cytokine stimulus, response to cytokine, cytokine-mediated signaling pathway, cellcular response to tumor necrosis facr and organelle envelope. The result from g:profiler mostly consists of cellular and biological response to cytokine which is quite different from the results from GSEA.
Enrichment Analysis is perfromed using Cytoscape 3.7.2 and the parameters are the followings:
Analysis type: GSEA Enrichment pos: gsea_report_for_na_pos_1585770708151.xls ENrichment neg: gsea_report_for_na_neg_1585770708151.xls GMT: Human_GOBP_AllPathways_no_GO_iea_February_01_2020_symbol.gmt Ranks: ranked_gene_list_na_pos_versus_na_neg_1585770708151.txt Expression: normalized_data.txt Phenotype: na_pos, na_neg FDR : 0.9
All the files are used in Cytoscape are generated from GSEA except normalized_data.txt. normalized_data.txt is the differential expression that is calculated in A2 and is inputed in a text file. It consists of all the gene names and their LogFC value.
There are 463 nodes and 1469 edges showing in the network.
The geneset with the highest positive NES:
The geneset with the lowest negative NES:
Annotated cluster: Using the application in Cytoscape, AutoAnnotate, to create annotation set
The parameters are the followings: Cluster Alogrithm: MLC cluster Edge Weight column: simlarity_coefficient Label Column: GS_DESCR Label Algorithm: WordCloud: Adjacent Words Max Word per lable: 3 Adjacent word bonus: 8
The annotation network has 90 clusters. The range size of the clusters is from 2 nodes to 33 nodes.
The top 3 largest upregulated cluster: mrna translation termination il2 1pathway il nucleus trna export
1 major downregulated cluster: chemokine chemotaxis migration
Putting the annoatated network into theme Go to AutoAnnotate panel to Create Summary Network, then select Clusters and Unclustered Nodes
Now, in the network, we have 189 nodes and 33 edges. 34 of all nodes are not clustered together and the rest are clusted and linked together.
From the GSEA reault, we can see MATRIX METALLOPROTEINASES pathway is in the top 10 upgregulated list. The MMP family and their inhibitor play a very important role multiple biological functions in all stages of cancer development. The stages of cancer developement include initiation to outgrowth of clinically relevant metastases and likewise in apoptosis and angiogenesis. MMPs and their inhibitors are crucital to many researchers and are investigated to be an anticancer drug
Import the network from Public database, searched the pathway by Wikipathway and entered WP129
Import the differential expression data (normalized_data.txt) to the pathway Change the color of the enzymes according to the expression level from normalized_data.txt
MMP3 and TIMP4 show the highest expression value and TCF20 shows the lowerst expression value
In the article, MMPs are not mentioned in the paper but from the results from A2, it is one of the most abundant family expressed in the dcis and IDC features. MMP3 involves in the breakdown of extracellular matrix in normal physiological processes, such as embryonic development, reproduction, and tissue remodeling, as well as in disease processes, such as arthritis and metastasis.As TIMP4 is the inhibitor of the matrix metalloproteinases which involves in degradation of the extracellular matrix. On the other hand, TCF20 is downregualted and is transcriptional coactivator and can enhance the activity of transcription factors. Mutation on this protein can lead to autism spectrum disorders which is less related to our article.